SOK-1005 Data science project Task 2 and onwards

Data science project spring 2023.

Author
Affiliation
Candidate number 3
Candidate number 11

UiT The Arctic University of Norway
Faculty of Biosciences, Fisheries and Economics.

Published

June 1, 2023

The files has been split into Task1 and one for Task 2, 3 and 4. This has been done after we recieved an email telling us we needed this to deliver. The Full project is the unedited version.
If you prefer to use the GitHub hosted files, then use these links
Task 1
Tasks 2, 3 and 4
Full project (Recommended)
If you have downloaded the files, then use these links
Task 1
Tasks 2, 3 and 4
Full project (Recommended)
Code
rm(list=ls()) #cleaning our environment
# Load required libraries
library(tidyverse)
library(haven)
library(curl)
library(utils)
library(janitor)
library(glue)
library(leaflet)
library(knitr)
library(plotly)
library(readr)


# Code to be used to see what type of category we get to work with
category <- c("Analgesics","Bath Soap","Beer","Bottled Juices","Cereals",
              "Cheeses","Cigarettes","Cookies","Crackers","Canned Soup",
              "Dish Detergent","Front-end-candies","Frozen Dinners","Frozen Entrees",
              "Frozen Juices","Fabric Softeners","Grooming Products","Laundry Detergents",
              "Oatmeal","Paper Towels","Soft Drinks","Shampoos","Snack Crackers",
              "Soaps","Toothbrushes","Canned Tuna","Toothpastes","Bathroom Tissues")

letter2number <- function(x) {utf8ToInt(x) - utf8ToInt("A") + 1L}
seed_number <- sum(letter2number("Daniel")) + sum(letter2number("Daniel"))
set.seed(seed_number)
cat(glue("Our seed number is {seed_number} so our category is {sample(category, 1)}")) #Making a print function to display nicer in html
Our seed number is 410 so our category is Shampoos
Code
# We free up ram by removing old dataframes we dont need to keep loaded.
rm(category, letter2number, seed_number)

We start by loading the packages we will use to do the tasks. Then we run the code that ensures we know which product category we are given with our “seed” which is Shampoo.

Task 2

We have chosen the specific brands because some of the brands left the marked and some came into the marked over the years and 1994 was the most consistent for the brands we had chosen.

We run a code that groups all the stores and sums across all their customers, then we filter out only the outlet that has the most customers, this is Dominicks 98 which we will look further into and store it in a separate dataset.

Code
tryCatch({df <- read_csv("sok-1005_data.csv")}, error = function(err) {df <<-read_csv("https://raw.githubusercontent.com/Danieljoha/Sok-1005-Data_science_project/main/sok-1005_data.csv")})

df <- df %>%
  pivot_longer(cols = -c(date, store, week, move, qty, price, profit, sales, custcoun, haba, brand, name, city, zip, lat ,long),# Specifing the columns not to pivot
               names_to = "variable_name",# Set the new column name for the values
               values_to = "dflong_value")


#lager demo_stata igjen
 
demo_stata <- df %>% 
  ungroup() %>% 
  select(-date, -brand) %>% 
  group_by(store, name, city, zip, lat, long, variable_name, dflong_value) %>%  
  summarise(across(1:8, \(x) sum(x, na.rm = FALSE)), .groups = "keep") %>% 
  ungroup() %>% 
  select(-week)



df <- df %>%
  filter(variable_name %in% c("% Population under age 9", NA)) %>% 
  select(-name, -city, -zip, -lat, -long, -variable_name, -dflong_value)
Code
# Filtering for the most relevant store to plot
dominicks_98 <- df %>% 
  group_by(store) %>% 
  summarise(total_custcoun = sum(custcoun)) %>% 
  filter(total_custcoun == max(total_custcoun)) %>% 
  left_join(df, by = "store") %>% 
  select(-total_custcoun)

# Pivoting to use facet wrap to show more plots
dominicks_98_long <- dominicks_98 %>%
  pivot_longer(cols = c(sales, haba,custcoun, profit), 
               names_to = "names", values_to = "values")
Code
# Starting the plot
dominicks_98_long %>%
  ggplot(aes(x = date, y = values, fill = names)) +
  geom_col() + # using col for bar chart
  theme_minimal()+ # theme
  labs(x = "", y = "Verdier", # changing some names
       fill = "names", labels="", title="Dominicks 98") +
  scale_fill_manual(values = c("custcoun" = "#276DB6", # Changing colors manually
                               "haba" = "#B627A5",
                               "profit" = "cornflowerblue",
                               "sales" = "purple"))+
  # facet for 4 different figures in same plot
  facet_wrap(~ names, scales="free", labeller = labeller(names = c("custcoun" = "Customer count", 
                                                                  "haba" = "Total cosmetic sales", 
                                                                  "profit" = "Profit", 
                                                                  "sales" = "Sales")))+
  # Setting some adjustments in the theme for positions of title and legend
  theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.4, size=20), strip.text = (element_text(size = 17)))+
  # Fixing the x-axis to show months
  scale_x_date(date_breaks = "1 month", date_labels = "%b")

Here you see weekly customer counts for Dominicks 98 which has the most customers of all the Dominicks stores in Chicago. Dominicks 98 was closed in October 1994 so we don’t have data for this month but it was still the best performing store. Further, we can see total cosmetic sales, which is the weekly turnover for shampoo in this store. Gross profit is the difference in profit between purchase price and selling price. Total cosmetic sales and gross profit are in dollars. Sales is a variable that we created via the recipe in the manual, according to the manual this shows us total dollar sales including things that are in bundles.

Code
# We ended up not using this figure, if you wondered why there is a code-fold here
#| column: page
brand_colors <- c("Bold Hold" = "#1f77b4",
                  "Head and Shoulders" = "#ff7f0e",
                  "Ivory" = "#2ca02c",
                  "Spirit" = "#d62728",
                  "Vo5" = "#9467bd")



old_fig <- dominicks_98 %>% 
  mutate(haba = haba / 1000,
         brand = case_when(
           brand == "bold_hold" ~ "Bold Hold",
           brand == "head_and_shoulders" ~ "Head and Shoulders",
           brand == "ivory" ~ "Ivory",
           brand == "spirit" ~ "Spirit",
           brand == "vo5" ~ "Vo5"
         )) %>% 
  plot_ly() %>%
  add_trace(x = ~date, y = ~haba, type = 'bar', split = ~brand, color = ~brand, colors = brand_colors,
            text = ~brand, hovertemplate = "Brand: %{text}<br>Value: %{y}<extra></extra>") %>%
  layout(title = list(text = "Dominicks 98", xref = "paper", x = 0.4),
         xaxis = list(title = "", tickformat = "%b", dtick = "M1"),
         yaxis = list(title = "Verdier i dollar(1000)"),
         legend = list(orientation = "h", x = 0.5, y = -0.1, xanchor = "center"),
         barmode = 'stack',
         annotations = list(
           list(text = "Weekly sales for shampoo", xref = "paper", yref = "paper", x = 0.4, y = 0.99, showarrow = F, font = list(size = 12))
         )) %>% 
  config(displayModeBar = FALSE)

#old_fig
Code
# Defining colors for each brand
dominicks_98_colors <- dominicks_98 %>%
  mutate(brand = case_when(
    brand == "bold_hold" ~ "Bold Hold",
    brand == "head_and_shoulders" ~ "Head and Shoulders",
    brand == "ivory" ~ "Ivory",
    brand == "spirit" ~ "Spirit",
    brand == "vo5" ~ "Vo5")
  ) %>% 
  mutate(color = brand_colors[brand]) %>%
  mutate(week = as.integer(format(date, "%W")))

dominicks_98_plot <- dominicks_98_colors %>%
  split(.$brand) %>%
  map(
    ~ ggplot(data = .x, aes(x = week, y = haba / 1000, fill = brand,
                     text = paste0("Week: ", week, "<br>",
                     "Brand: ", brand, "<br>",
                     "Health and beauty sales in thousands: <br>$", haba / 1000))) +
      geom_col() +
      scale_fill_manual(values = unique(.x$color)) +
      theme_minimal() +
      labs(x = "Week", y = "Dollar sales for each brand (1000)") +
      scale_x_continuous(expand = c(0, 0), breaks = seq(1, 52, by = 1), labels = seq(1, 52, by = 1)) +
      theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7), plot.title = element_text(hjust = 0.4)) +
      ggtitle(paste(unique(.x$brand)))
  )


# Convert ggplot objects to Plotly plots
dominicks_98_plotly <- dominicks_98_plot %>%
  map(~ggplotly(.x, tooltip = "text"))


# Print each modified plot
dominicks_98_plotly[[1]]
dominicks_98_plotly[[2]]
dominicks_98_plotly[[3]]
dominicks_98_plotly[[4]]
dominicks_98_plotly[[5]]
Code
brand_list <- c("Bold Hold", "Head and Shoulders", "Ivory", "Spirit", "Vo5")
brand_map <- c("bold_hold" = "Bold Hold", 
               "head_and_shoulders" = "Head and Shoulders",
               "ivory" = "Ivory",
               "spirit" = "Spirit",
               "vo5" = "Vo5")

for (brand in brand_list) {
  original_brand_name <- names(brand_map)[brand_map == brand] # get the original brand name
  
  brand_data <- dominicks_98 %>% 
    mutate(haba = haba / 1000,
           brand = case_when(
             brand == original_brand_name ~ brand
           )) %>%
    filter(brand == brand)
  
  plot_ly(brand_data) %>%
    add_trace(x = ~date, y = ~haba, type = 'bar', color = ~brand, colors = brand_colors[brand],
              text = ~brand, hovertemplate = paste("Brand: ", brand, "<br>Value: %{y}<extra></extra>")) %>%
    layout(title = list(text = paste("Dominicks 98 - ", brand), xref = "paper", x = 0.4),
           xaxis = list(title = "", tickformat = "%b", dtick = "M1"),
           yaxis = list(title = "Verdier i dollar(1000)"),
           legend = list(orientation = "h", x = 0.5, y = -0.1, xanchor = "center"),
           annotations = list(
             list(text = paste("Weekly sales for", brand), xref = "paper", yref = "paper", x = 0.4, y = 0.99, showarrow = F, font = list(size = 12))
           )) %>% 
    config(displayModeBar = FALSE)
}

In figure 2, we look at which brand sells best among the shampoos we have selected with a lot of movement. You can see that Vo5 clearly has the largest sales volume measured in dollars, and that Head and Shoulders is not as big as they are today.

Code
# List of variables of interest
variables2 <- list(c('% of population that is white',
                     '% of population that is neither black & hispanics or white', 
                     '% Blacks & Hispanics'),
                  c("% of people that arent college graduates", "% College Graduates"),
                  c("% of population with income under $15,000", "% of population with income over $15,000"),
                  c("% of households without mortages", "% of households with mortgages"))

# List to store the resulting datasets
datasets <- list()

# Loop through the variables and create the corresponding datasets
for (i in seq_along(variables2)) {
  datasets[[i]] <- demo_stata %>%
    filter(store == "98") %>%
    unique() %>%
    filter(variable_name %in% variables2[[i]])
}

In figure 3, the group management can see which customer group shops the most at Dominicks 98, those who contribute the most to sales in this particular store are 81% white, 1.2% other, and 17.8% black or Latin American. Further in figures 4, 5 and 6, we look at whether the customers are students, above the poverty line, and whether the customers have mortgages.

Code
variable_names <- list(c('% of population that is white',
                         '% of population that is neither black & hispanics or white', 
                         '% Blacks & Hispanics'),
                       c("% of people that arent college graduates", "% College Graduates"),
                       c("% of population with income under $15,000", "% of population with income over $15,000"),
                       c("% of households without mortages", "% of households with mortgages"))

create_pie_chart <- function(dataset, variable_names) {
  pie_data <- dataset %>%
    filter(variable_name %in% variable_names) %>%
    mutate(display_value = round(dflong_value * 100, 2),
           custom_text = paste0(variable_name, ": ", display_value, "%"))  
  
  colors <- c('blueviolet', '#B627A5', 'cornflowerblue')
  
  pie_chart <- plot_ly(pie_data, labels = ~variable_name, values = ~display_value, type = 'pie', textinfo = 'percent', 
                       marker = list(colors = colors), text = ~custom_text, hovertemplate = "%{label}: <br>%{value}%<extra></extra>") %>%
    layout(title = "", 
           legend = list(orientation = "h", x = 0.5, y = -0.3, xanchor = "center")) %>% 
    config(displayModeBar = FALSE)

  return(pie_chart)
}



plots <- pmap(list(datasets, variable_names), create_pie_chart)
Code
 plots[[1]]
 plots[[2]]
 plots[[3]]
 plots[[4]]

Task 3

It is important to note that because it is weekly data we have redefined our week 225 to start at 1994. Because we are using monthly data now it shouldn’t make a large difference but the data now includes the final 4 days from 1993 and doesnt include the last 4 days of 1994. This means January had a few days of data from December 1993, February has a few days of January and so on.

Code
#CURSED

week_225_start <- as.Date("1994-01-01")#Overwriting start date


df_monthly <- df %>%
  ungroup() %>% 
  mutate(date = week_225_start + (week - 225) * 7) %>% 
  mutate(month = month(date)) %>% 
  select(-date, -week) %>% 
  group_by(month, brand) %>% 
  summarise_all(sum)

First, we will present monthly values for all Dominick’s stores from the entire chain. We can see these in the figure below.

Code
df_monthly %>%
  pivot_longer(cols = c(sales, haba,custcoun, profit), 
               names_to = "names", values_to = "values") %>% 
  ggplot(aes(x = month, y = values, fill = names)) +
  geom_col() + # using col for bar chart
  theme_minimal()+ # theme
  labs(x = "", y = "Verdier", # changing some names
       fill = "names", labels="", title="All Dominicks stores. Monthly values") +
  scale_fill_manual(values = c("custcoun" = "#276DB6", # Changing colors manually
                               "haba" = "#B627A5",
                               "profit" = "cornflowerblue",
                               "sales" = "purple"))+
  # facet for 4 different figures in same plot
  facet_wrap(~ names, scales="free", labeller = labeller(names = c("custcoun" = "Customer count", 
                                                                  "haba" = "Total cosmetic sales", 
                                                                  "profit" = "Profit", 
                                                                  "sales" = "Sales")))+
  # Setting some adjustments in the theme for positions of title and legend
  theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.4, size=20),strip.text = element_text(size = 15))+
  scale_x_continuous(expand=c(0,0), n.breaks = 12)

Next, we look at which brands are doing well. As in the previous task, we can see that Vo5 is selling the best across all of Chicago. NB: The y-axis has different scales on the different plots.

Code
# Define colors for each brand
brand_colors <- c("Bold Hold" = "#1f77b4",
                  "Head and Shoulders" = "#ff7f0e",
                  "Ivory" = "#2ca02c",
                  "Spirit" = "#d62728",
                  "Vo5" = "#9467bd")


df_monthly_with_colors <- df_monthly %>%
  mutate(brand = case_when(
           brand == "bold_hold" ~ "Bold Hold",
           brand == "head_and_shoulders" ~ "Head and Shoulders",
           brand == "ivory" ~ "Ivory",
           brand == "spirit" ~ "Spirit",
           brand == "vo5" ~ "Vo5")) %>% 
  # Add a color column based on brand
  mutate(color = brand_colors[brand])

plots_monthly <-df_monthly_with_colors %>%
    split(.$brand) %>%
    map(~ ggplot(data = .x, aes(x = month, y = haba/1000, fill = color)) +
          geom_col() +
          scale_fill_identity() +
          theme_minimal() +
          scale_x_continuous(expand = c(0,0), n.breaks = 12) +
          labs(x="",y="Dollar sales for each brand(1000)")+
          theme(plot.title = element_text(hjust = 0.4))+
          ggtitle(paste(unique(.x$brand))))

# Print each plot
invisible(map(plots_monthly, print))

Task 4

Code
loc_data <- demo_stata[!duplicated(demo_stata[,c("store")]),]

loc_data <- loc_data %>% 
  select(-dflong_value, -variable_name)

loc_data %>% 
  mutate(lat = lat/10000, long = long/10000*-1) %>% 
  leaflet() %>% 
  addTiles(options = tileOptions(minZoom=10, maxZoom=13)) %>% 
  addCircleMarkers(clusterOptions = markerClusterOptions(maxClusterRadius = 20),
                   lat = ~lat, lng = ~long, #locations for markers
                   radius = 12, #circle size
                   color = "red", #color of circle
                   fillOpacity = 0.5,
                   stroke = FALSE, #no outline
                   label = ~as.character(store), #store labels
                   labelOptions = labelOptions(noHide = TRUE, direction = "center", textOnly = TRUE, fontSize = 16, fontWeight = "bold"), 
                   popup = paste0("<strong>Store:</strong>", loc_data$store, 
                                  "<br><strong>Name:</strong>", loc_data$name, 
                                  "<br><strong>City:</strong>", loc_data$city, 
                                  "<br><strong>Zip:</strong>", loc_data$zip,
                                  "<br><strong>Profit from shampoo sales:</strong> $", loc_data$profit,
                                  "<br><strong>Total shampoo sales:</strong> $", loc_data$sales,
                                  "<br><strong>Customer count</strong>(thousands): ", round(loc_data$custcoun/1000),
                                  "<br><strong>Health and beauty sales</strong>(thousands): ", round(loc_data$haba/1000)))

The data we have can be used to find new and suitable retail outlets. What we can do, for example, is to look at the best retail outlets we already have over time in aggregate, then take a regression analysis of, for example, the 5 or 10 best places and see if the customer flow has changed positively over time. If we then manage to find a location that is popular and has an increasing customer group, we can see if there are any locations nearby where we can set up a new store to cater to the increasing customer flow.

Code
#making an empty list
model_list <- list()

#looping to add to the lsit
for (store_id in unique(df$store)) {
  
  #Subset the data for the current store
  store_data <- subset(df, store == store_id)
  
  #fitting the model
  model <- lm(custcoun ~ date, data = store_data)
  
  #adding the model and coefficients into the list
  model_list[[as.character(store_id)]] <- list(model = model, coef = coef(model))
  
}

#Getting the coefficients and r-squared values for all stores
coef_df <- data.frame(do.call(rbind, lapply(model_list, function(x) x$coef)))

Shows top 10 stores with increasing customer trend.

Code
table <- coef_df %>%
  arrange(-date) %>%
  rownames_to_column(var = "store") %>% 
  mutate(store = as.numeric(store))

table_names <- table %>% 
  rename("Intercept" = X.Intercept., Coefficient = date) %>% 
  slice_head(n = 10)

table_names
store Intercept Coefficient
8 -1595356.1 192.01665
80 -1569729.5 187.34824
78 -970100.1 116.64968
107 -675057.0 90.44691
86 -481841.6 65.11091
129 -399563.1 57.46547
84 -379781.5 54.01417
139 -291179.8 48.66049
77 -302744.0 44.45087
137 -244790.8 40.76657
Code
# Plotting the most promising store
df %>%
  filter(store == 8) %>%
  ggplot(aes(x=date,y=custcoun)) +
  geom_point()+
  geom_smooth(method=lm, se=FALSE)+
  labs(x="", y="Customers", title="Dominicks 8 customers regression")+
  theme_minimal()+
  scale_x_date(date_breaks = "1 month", date_labels = "%b")

The results from the regression analysis show that store 8 has the most increasing customer flow. If we look at the map above, this location has good coverage of Dominicks stores. What we should be able to see is what this store does better than the others we have in the area already, and if a new store is to be set up, it should be further east. Otherwise, when we look at the map above, the location that could have more stores is West Chicago, for example with Winfield.

Code
df %>%
  filter(store == 74) %>%
  ggplot(aes(x=date,y=custcoun)) +
  geom_point()+
  geom_smooth(method=lm, se=FALSE)+
  labs(x="", y="Customers", title="Dominicks 74 customers regression")+
  theme_minimal()+
  scale_x_date(date_breaks = "1 month", date_labels = "%b")

Dominicks 74 is the store with the most declining customers.

Code
table <- coef_df %>%
  arrange(-date) %>%
  rownames_to_column(var = "store") %>% 
  mutate(store = as.numeric(store))


palette <- colorNumeric(palette = c("red", "green"), domain = loc_data$sales)



loc_data %>% 
  mutate(lat = lat/10000, long = long/10000*-1) %>% 
  leaflet() %>% 
  addTiles(options = tileOptions(minZoom=9, maxZoom=13)) %>% 
  addCircleMarkers(clusterOptions = markerClusterOptions(maxClusterRadius = 20),
                   lat = ~lat, lng = ~long, #locations for markers
                   radius = 12, #circle size
                   color = ~palette(sales), #color of circle based on sales
                   fillOpacity = 0.5,
                   stroke = FALSE, #no outline
                   label = ~as.character(store), #store labels
                   labelOptions = labelOptions(noHide = TRUE, direction = "center", textOnly = TRUE, fontSize = 16, fontWeight = "bold"), 
                   popup = paste0("<strong>Store:</strong>", loc_data$store, 
                                  "<br><strong>Name:</strong>", loc_data$name, 
                                  "<br><strong>City:</strong>", loc_data$city, 
                                  "<br><strong>Zip:</strong>", loc_data$zip,
                                  "<br><strong>Profit from shampoo sales:</strong> $", loc_data$profit,
                                  "<br><strong>Total shampoo sales:</strong> $", loc_data$sales,
                                  "<br><strong>Customer count</strong>(thousands): ", round(loc_data$custcoun/1000),
                                  "<br><strong>Health and beauty sales</strong>(thousands): ", round(loc_data$haba/1000))) %>% 
  addLegend(pal = palette, values = ~sales, title = "Shampoo total<br> dollar sales ($)")

On this interactive map, we can finally see that it is primarily the central Chicago stores that are doing the best and those on the outskirts that are doing the worst. Our conclusion is that it is in those areas in the middle of Chicago with the least coverage where it will be most profitable to set up a new store or multiple new stores. And by running a code that we had help from ChatGPT to see there might be some form of correlation. However for the correlation numbers it is important to note that statsistical significance is important when interpreting correlations, the correlation may be not zero but if its not statistically significant then it may not be meaningful to look at. We have not used any form of statistical test as we lack the knowledge to do this in a meaningful way but we felt it may be relevant to look at.

Code
correlations <- demo_stata %>% 
  pivot_wider(names_from = variable_name, values_from = dflong_value) 

#This code is with the use of ChatGPT
# List of demographic columns
demographic_cols <- colnames(correlations)[5:6]

# Initialize an empty vector to store correlation results
correlation_results <- c()
#This code is with the use of ChatGPT
# Calculate correlation for each demographic variable
for (col in demographic_cols) {
  correlation <- cor(correlations$sales, correlations[[col]], use = "pairwise.complete.obs")
  correlation_results <- c(correlation_results, correlation)
}

# Combine demographic column names and correlation results into a dataframe
correlation_df <- data.frame(Demographic = demographic_cols, Correlation = correlation_results)
#This code is with the use of ChatGPT
# Print the correlation dataframe
correlation_df
Demographic Correlation
lat 0.1445141
long -0.1655860

This shows us that there is maybe a good idea to have stores with a higher value of latitude which is to the north, and a lower longtitude which is to the east so having your store in the north east chigago area might be something to look into.

Code
correlations <- demo_stata %>% 
  pivot_wider(names_from = variable_name, values_from = dflong_value) 


#This code is with the use of ChatGPT
# List of demographic columns
demographic_cols <- colnames(correlations)[14:ncol(correlations)]

# List of variables for which you want to calculate correlations
variables <- c("sales", "profit", "custcount", "haba")

# Initialize an empty dataframe to store correlation results
correlation_df <- data.frame()
#This code is with the use of ChatGPT
# Calculate correlation for each demographic variable
for (var in variables) {
  correlation_results <- c()
  for (col in demographic_cols) {
    # Ensure both columns are numeric before calculating correlation
    if (is.numeric(correlations[[var]]) & is.numeric(correlations[[col]])) {
      correlation <- cor(correlations[[var]], correlations[[col]], use = "pairwise.complete.obs")
      correlation_results <- c(correlation_results, correlation)
    } else {
      correlation_results <- c(correlation_results, NA)
    }
  }
  temp_df <- data.frame(Demographic = demographic_cols, Correlation = correlation_results, Variable = var)
  correlation_df <- rbind(correlation_df, temp_df)
}

# Subset the dataframe to keep only rows with correlation greater than 0.1 or less than -0.1. #This code is with the use of ChatGPT
correlation_df <- subset(correlation_df, abs(Correlation) > 0.1)
#This code is with the use of ChatGPT

#rownames(correlation_df) <- NULL
# Print the correlation dataframe
correlation_df
Demographic Correlation Variable
5 % Working Women with full-time jobs -0.1284505 sales
8 % of Retired 0.1091721 sales
10 % of Unemployed 0.1189193 sales
15 % of households with mortgages -0.2543562 sales
16 % of households without mortages 0.2543562 sales
21 % of population that is neither black & hispanics or white 0.1850833 sales
26 % of women that arent working 0.1284505 sales
27 % of working women with children -0.2062318 sales
28 % of working women with children 6 - 17 -0.2004089 sales
31 % Blacks & Hispanics -0.2206697 profit
32 % College Graduates -0.1124497 profit
45 % of households with mortgages -0.1341649 profit
46 % of households without mortages 0.1341649 profit
47 % of non-working women with children 0.1022449 profit
49 % of people that arent college graduates 0.1124497 profit
52 % of population that is non-white -0.2049068 profit
53 % of population that is white 0.2049068 profit
93 % Population over age 60 -0.1037599 haba
94 % Population under age 9 0.1508498 haba
100 % of Unemployed 0.2141608 haba
102 % of households with 2 persons -0.1555487 haba

The variable “haba” means health and beauty sales.

What we can see here is that there might be some signs of there being some correlation with demographic data but we can not be sure. Further analysis is needed.

References

The data used for this assignment is gathered at the following links

https://www.chicagobooth.edu/research/kilts/datasets/dominicks

https://www.chicagobooth.edu/-/media/enterprise/centers/kilts/datasets/dominicks-dataset/dominicks-manual-and-codebook_kiltscenter

We have also used the resources from the courseplan in SOK-1005 and used the lectures given by Dejene Gizaw Kidane to help us create the code to wrangle the data so there may be similarities at places in Task 1. However any and all errors are ours and ours alone.

The lectures are found here https://github.com/uit-sok-1005-v23/uit-sok-1005-v23.github.io

Some of the code used to create some of the figures and datawrangling has been taken from our delivered assignments. The candidat numbers are added in the top of the document and we have delivered the assignments in Canvas with links to our Github pages. However for the purpose of anonymity of the authors of this task, our github pages contain our names so this should only first be checked after grading is final.

Appendix for AI usage

Task 1

In the beginning of Task 1 there has been use for ChatGPT to help define the start of “week 1” as we were unsure of how to convert the weekly data to date but it gave us the idea which we used to create the code. Codeline 136 and 146

We used ChatGPT to enter in the variable names after posting the variable names from the PDF. Codelines 157-67

In the code line 251 we had help from ChatGPT to write the REGEX code to help us pick out the brands as the brands had weird names with changing use of letters and such.

Finally we have used ChatGPT to help us translate the documentation we had made for Task1 to english by promting it with “translate to english” and adding our text. This was done to help grade the task.

Task 2

Helped make a trycatch code so if file isnt found, it is gotten from github where it is hosted

ChatGPT has been used to help create a better structure in the text we have made. This has been done by promting the model with our text and asking it to improve it. With some translations

Task 3

Minimal usage

Again used to help create better structure in the text and at times translate our norwegian comments.

Task 4

The use of ChatGPT is shown on this link and we translated one section to english in task 3.

The link shows us promting ChatGPT saying we have a dataframe in R and explaining a bit of the dataframe. The AI then gives us some code and i then specify we want specific columns to be checked. It then gives the code that is used at the end of Task 4 and it has been changed minimally.

Lastly it shows that i asked it to translate a section of Task 3 to to english which was done to help grade this task.